Distributed User Profiling via Spectral Methods 



Dan-Cristian Tomozei 
EPFL, Switzerland 

dan-cristian.tomozei@epfl.ch 



laurent.massoulie@technicolor.com 



Laurent Massoulie 
Technicolor, France 



Abstract 



User profiling is a useful primitive for constructing personalised services, such as content recom- 
mendation. In the present paper we investigate the feasibility of user profiling in a distributed setting, 
with no central authority and only local information exchanges between users. We compute a profile 
vector for each user (i.e., a low-dimensional vector that characterises her taste) via spectral transforma- 
tion of observed user-produced ratings for items. Our two main contributions follow: 

i) We consider a low-rank probabilistic model of user taste. More specifically, we consider that 
users and items are partitioned in a constant number of classes, such that users and items within 
the same class are statistically identical. We prove that without prior knowledge of the compo- 
sitions of the classes, based solely on few random observed ratings (namely 0(N log N) such 
ratings for TV users), we can predict user preference with high probability for unrated items by 
running a local vote among users with similar profile vectors. In addition, we provide empiri- 
cal evaluations characterising the way in which spectral profiling performance depends on the 
dimension of the profile space. Such evaluations are performed on a data set of real user ratings 
provided by Netflix. 

ii) We develop distributed algorithms which provably achieve an embedding of users into a low- 
dimensional space, based on spectral transformation. These involve simple message passing 
among users, and provably converge to the desired embedding. Our method essentially relies on 
a novel combination of gossiping and the algorithm proposed by Oja and Karhunen. 

Keywords Spectral Decomposition Random Matrix, Message Passing, Distributed Spectral Embed- 
ding, Distributed Recommendation System 



1 Introduction 



Recommendation systems have attracted much interest lately, mostly because of their relevance to core 
businesses of several major companies (e.g. Amazon, Netflix, Yahoo) who offer large catalogues of prod- 
ucts to a vast user base. While the advertisement of highly popular items is straightforward, a significant 
portion of business stems from sales of only mildly popular items. The latter cannot be advertised in- 
discriminately, and must be recommended to the "right" users, through targeted recommendations. Such 
companies dispose of large storage and computational resources which enable a centralised computation 
of recommendations. 

In this paper we take a different perspective on the problem of recommendation. Namely, we aim to 
develop strategies suited to distributed operation, where the burden of recommendation is not offloaded 
to the server, but is rather shared among the users. More specifically, we propose the following two-stage 
approach for generating recommendations: 

• In the first stage, distributed algorithms assign coordinates (or profiles) to the users within a certain 
profile space, such that proximity in this space translates to proximity of user taste for content. We 
say that such algorithms perform user profiling. 

• In the second stage, recommendations are obtained via simple and distributed algorithms which 
rely on the primitive of user profiling. We thereby avoid the need for complex machine learning 
techniques. 

The performance of such an approach will depend heavily on the properties of the considered embedding 
of users in the profile space. For this reason, the focus in this paper is on the first stage of the process, 
i.e., user profiling. Namely, we argue that spectral profiling techniques retrieve hidden structure. 

The techniques employed in a centralised setting for generating content recommendation are widely 
known under the generic name of "collaborative filtering". They are typically implemented by a provider 
who wishes to offer a recommendation service to a large customer base. In such a setting, the information 
requested from the customers (or users) is typically related both to their identity (via the registration 
procedure) and to their taste (via the opinions they express regarding the items). 

It is not clear to which extent identity information characterises user taste. Moreover, the nature of 
such information gives rise to privacy concerns. On the contrary, the opinions that users express about 
items constitute the truly relevant data for solving the problem. For this reason, we advocate a purely 
agnostic approach to recommending content, which does not use information about the real identity of 
users, or the nature of content. 

Opinions are expressed in the form of ratings assigned by a user to the items she has already pur- 
chased. Ratings characterise the satisfaction of a user with respect to a specific item. They are discrete 
and range from a lowest to a highest value (e.g., number of stars). In particular, the mere fact that a user 
has consumed or not a specific item can be regarded as a binary rating. In this paper we consider the 
latter form of rating. 

Since the number of items on offer from the provider is overwhelmingly large, the vast majority of 
users only consume a small fraction of items. Hence, for a typical user, only a small number of ratings 
are known. In the case of binary ratings, if an item has not been consumed it does not necessarily follow 
that the user dislikes it. It is possible that the user is simply unaware of the item's existence. Hence, in 
this case we cannot distinguish between missing ratings and disliked items. 

A recent illustration of the possible machine learning techniques and of the corresponding perfor- 
mance comes from the Netflix prize competition [12|. The goal of the competition was to design an 
algorithm that, when trained on a data set made publicly available by Netflix, would manage to improve 
prediction accuracy by 10% (measured via Root Mean Squared Error) compared to the proprietary Cine- 
match algorithm. The designers of such an algorithm [9] were awarded a prize of $1M three years after 
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the start of the competition. It is important to note that the last two years of the three were spent trying to 
improve the gain in prediction from 8.42% to 10.04%. 

The strenuous advancement of the Netflix prize suggests the existence of an important obstacle in the 
way of achieving high prediction accuracy. A possible explanation is given in the study conducted by 
Amatriain et al. [il|, which showed that when presented with a movie title several times, users provide 
inconsistent ratings. Hence, there exists an implicit noise in any collection of ratings due to the fact that 
human taste is variable and not easily quantified. The authors [ 1 j used the RMSE to characterise the 
distance between two sets of ratings assigned by the same users to the same movies. They found RMSE 
values comparable to the winning entry of the Netflix prize. These results suggest that user preference 
has a significant random component. As such, it is reasonable to consider a probabilistic model of user 
taste. Throughout this paper we make the following 

Assumption 1. The taste of each user is characterised by a certain probability distribution defined on 
the set of all possible ratings for the set of items. A user's observed ratings are obtained by sampling her 
corresponding multi-dimensional distribution. 

Denote the set of users by U and the set of items by T . A natural representation for the observed 
binary rating information is as follows. Consider a rectangular matrix 5 e {0, l}l w l x l- 7r l which we call 
the rating matrix. Each row corresponds to a user and each column corresponds to an item. An entry S u { 
corresponds to the user-item pair (u, i) G U x T . The entry holds 1 if user u has purchased item i, and 
otherwise. As previously stated, most of the zero entries in the rating matrix correspond to cases in which 
a user has not considered purchasing a specific item. According to AssumptionQ] each row of matrix S 
is partially observed realisation of a {0, 1} random | T | -dimensional vector. Relying only on matrix S, 
we need to assign profiles to the users, such that users with similar taste have similar profiles. 

The instances of the problem we consider are extremely large, it is not uncommon to have a user base 
U of the order of millions and a catalogue of items T of the order of tens of thousands. In the case of 
binary ratings, simply representing the probability distribution in Assumption[T]for a single user requires 
an exponential amount of memory 2'™. We are thus constrained to consider very simple approximations 
of such probability laws, for the sake of computational tractability. 

Like most proposed models of user taste found in the literature (e.g., JTJ), we consider a low-rank 
model of user taste. More specifically, we make the following 

Assumption 2. Each entry S u i of the rating matrix is given by a Bernoulli random variable of parameter 
S u i. The matrix S — (S u i) has rank K, where K -C \U\ and K <C | F\. 

Hence, we consider a low-mnkprobabilistic model of user taste, as opposed to a deterministic one [7]. 

In Section [3] we propose a user profiling technique based on the Singular Value Decomposition of 
matrix S. For a probabilistic model of user taste satisfying Assumption [2] and under further weak sta- 
tistical assumptions, we prove that a simple voting scheme among users with similar profiles manages 
to produce accurate recommendations for most of the items with high probability. Furthermore, we use 
actual movie ratings to compute the profiles of anonymous users of the Netflix system. We observe that 
users with similar profiles have similar taste in movies. 

Motivated by the ability to recover hidden structure of the spectral techniques, in the second part 
of this paper (Section @}, we design a distributed algorithm that computes individual spectral profiles 
based on local exchanges among users. We prove almost sure convergence of the algorithm and provide 
evaluations on a synthetic trace. We conclude in Section|5] 

2 Related Work 

Keshavan et al. [7 1 consider the problem of low rank matrix completion. They show that for a constant 
rank r = 0(1) "well-behaved" matrix, it is sufficient to have f2(JV log N) revealed entries in order to be 
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able to achieve exact matrix reconstruction. For a square matrix, this corresponds to an average degree 
of f2(logiV), just as in our result. The drawback to a direct application of this result to the user taste 
prediction problem is the fact that the sought matrix is deterministic. Implicitly, it would mean that user 
ratings are deterministic, and that rating matrices are low rank. In contrast, the models we consider are 
probabilistic. 

In Section [3] we consider a low-rank probabilistic model of user taste. Users are partitioned into 
classes, such that users within the same class are statistically identical. We show (Theorems Q] and |2]i 
that the profile vectors corresponding to each user computed via spectral methods are clustered around 
distinct points corresponding to the classes. In this respect, our results are related to spectral clustering. 
There is a vast literature on the topic of spectral clustering, of which we now give a brief overview. 

Among the most relevant work, Ng et al. [ 1 3 1 propose a clustering algorithm based on spectral decom- 
position. Our results provide a comprehensive analysis by giving conditions under which the underlying 
partition into classes is retrieved exactly. 

In |4), Dasgupta et al. propose an algorithm based on iterative splitting of groups into two subgroups. 
In contrast, we obtain the desired groups in one go. In [10 1, McSherry proposes a different clustering 
method based on projections onto the column space of the original matrix. 

In both [4| and [10], the probabilistic model of the original matrix is akin to ours (which extends 
the classical "planted partition" model). In contrast, we establish our results under far less stringent 
conditions on the average degree of the original matrix. Namely, we require an average degree of order 
f2(log(iV)) while they require an order of f2(log(iV) 6 ). 

The recent paper [ 15 1 by Shi et al. discusses rationales for choosing which eigenvectors to use when 
performing spectral clustering. This issue is to a large extent complementary to the ones we address in 
this paper. We could rely on lfT31l to specify which eigenvectors to keep in our profiling context. 

In Section HTTI we propose a method for computing the eigenvectors of the adjacency matrix of a graph 
in a distributed manner. A variant of the method was briefly described in [ 17 1. Eigenvector extraction is 
the object of Oja's algorithm [14|. This basic algorithm was refined by Borkar and Meyn J2j. None of 
these approaches is distributed however. Our contribution in Section l4Tl consists precisely in augmenting 
these methods to make them distributed. 

A significant contribution towards computing the top k eigenvectors of a symmetric weighted adja- 
cency matrix in a distributed fashion was brought by Kempe and McSherry in H. The setting is similar 
to the one we consider in Section l4"Tl The authors give bounds on the required running time of their algo- 
rithm. Due to the fact that we explicitly introduce noise in our iterations, obtaining such bounds is more 
difficult in our case. However, while the algorithm in [6| performs an explicit orthonormalisation within 
the main loop, which requires a global synchronization point, our approach facilitates an asynchronous 
implementation. The normality of the vectors computed using our algorithm is a direct consequence of 
Oja's algorithm. Furthermore, the algorithm we analyze in Section l4~T1 requires simpler computations at 
each of the nodes compared to |5). 

A similar approach to the one presented in this paper has been taken in a recent publication (S). The 
authors aim to determine the eigenvectors of a deterministic matrix based on random sparse observations. 
They derive useful bounds on convergence time. However, the gossiping stage in their proposed algorithm 
is treated as a "black box". We explicitly construct an algorithm that incorporates two stages: gossiping, 
performed on a faster time scale, and Oja's method, performed on a slower time scale. Moreover, we 
explicitly determine multiple eigenvectors, whereas the authors of (8) focus on determining a single 
eigenvector, and argue that the extension can be achieved. Finally, we propose an asynchronous algorithm. 
We show on synthetic data that the latter determines the desired eigenvectors. 
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3 Spectral Recovery of Probabilistic Taste 



We begin by analysing a simple setting in which the observations consist of measures of similarity be- 
tween users. We prove that a profiling technique based on the spectral decomposition of the square matrix 
regrouping the observed similarities between pairs of users successfully recovers hidden structure. 

We apply these findings to the case in which ratings of items by users are observed. We show that a 
simple distributed voting algorithm provides asymptotically accurate predictions for most items. 

Finally, we observe the benefits of spectral profiling on a real trace. 

We make use of notations described in Table Q] Unless otherwise indicated, all vectors are column 
vectors. 



A' 

x 1 
e 

diag(a) 

\\y\U 
y(k), yk 



The transposition of matrix A 

The transposition of column vector x 

The all-ones column vector 

The K x K diagonal matrix having the elements on the main 
diagonal given by the if -dimensional vector a 
The a-norm of vector y, where < a < 1 and y are column 
vectors with the same dimension, \\y\\ a '■= vE 
The fc-th element of column vector y 



k a kVl 



Table 1 : General notations 



3.1 Similarity-based Profiling 

Denote by N = \U\ the number of users. Let us consider in this first stage that we are given partial 
observations of user taste similarity in the form of a symmetric matrix 

A€{0,l} NxN , 

Namely, for any two users u, v the elements A uv = A vu take value 1 if users u and v have been evaluated 
as similar, and value if the users are deemed dissimilar, or if the similarity between the two users has 
not been evaluated. By convention A uu = 0. 

We propose the following spectral representation of users based on these partial similarity observa- 
tions. For some fixed dimension L, extract the normalised eigenvectors xi, . . . ,xl corresponding to the 
L largest magnitude eigenvalues of matrix A. We define the profile space as R L . In the profile space, to 
each user u there corresponds a scaled row vector y/~N z' u , where 

z' u = (xi(u),...,x L (u)). 

We refer to this vector as the profile of user u. The scaling factor ^/N is introduced to compensate for 
the fact that the eigenvectors of the N x N matrix A are taken of norm 1. In what follows, we propose a 
simple probabilistic user taste model for which this spectral representation of users enables us to retrieve 
hidden structure. 

3.1.1 Statistical Model 

We assume the following probabilistic model of user taste: The N users are partitioned into K classes 
Ci, . . . , Ck, such that users within the same class are statistically identical. We denote the size of class 
Cfe, 1 < k < K, by |Cfc| = otkN, for fixed ctk > 0, which are such that J^k a k — !• F° r an Y user u EU 
we denote by k(u) her unique corresponding class. 
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U, V 


Indices referring to users 


k,£ 


User profile indices 


k(u) 


Index of the profile containing user u : u € Cm u \ 




Partition of users into K unknown disjoint profiles 


a 


Vector grouping fractions of users per profile: \Cu | = otkN 


B 


Constant unknown K x K matrix of probabilities 


P=f 


Probing probability 


L 


Dimension of profile space 



Table 2: Notations and conventions 



Each pair of classes 1 < k, I < K is characterised by probabilities b^i = bgk as follows. User u g Cu 
and user v € Ci are similar with probability b^g and dissimilar with probability 1 — bke- Moreover, for all 
pairs of users, their similarity is observed with probability p = j*, where uj is a parameter of the model. 
The similarity of unobserved pairs of users is set to by default. 

Equivalently, for all ordered pairs of users u < v, the observed similarities A uv = A vu are the 
outcome of independent Bernoulli random variables of parameters pbu u )k(v)- As previously stated, the 
diagonal elements are all null, A uu = 0. 

Given the observed similarity matrix A, without knowledge of the K x K profile similarity matrix 
B = {bkt)ki, we wish to recover the partition of users in the unknown classes Ck using their spectral 
representation. Below we provide sufficient scaling assumptions for which such recovery is possible. 

In Table |2] we summarise the notations we have introduced so far. 

3.1.2 Scaling Assumptions 

Let us describe the dependence of the various parameters of the model on the number of users. 

We have made the Assumption [2] namely that our model is low -rank. Thus, we consider that the 
number of classes K, as well as the fraction of users in each class, and the similarity probability 
matrix B = (bki)k,i are constant as the number of users in the system grows. 

However, we assume that the probing probability p vanishes as the number of users grows. Namely, 
cj goes to infinity slower than N, 

uj — >jy oo, uj — o(N). (1) 

Hence, each user probes on average a fraction p = jr —> of users with whom she evaluates taste 
similarity. Equivalently, matrix A can be regarded as the adjacency matrix of a random graph on the set 
of users having average degree of Q(uj). 

3.1.3 Hidden Structure Recovery 

Theorem Q] below states that under mild conditions, for large N, the spectral representations \[N z' u of 
users are clustered according to their respective classes Ck- Consider the vector 

The a-norm of a X-dimensional vector t is ||t||^ = X^^^fe (see Table[T). Define the following constant 
matrix 

M := (bkeo>e)i<k t £<K- 
Before stating the theorem, we introduce the following conditions: 
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The dimension of the profile space is upper bounded by L < rank(M). (2a) 
The L largest magnitude eigenvalues of M have distinct absolute values. (2b) 
The corresponding eigenvectors normalised under the a-norm, yi, . . . , y^, satisfy: 

t' k ^^l<k^t<K, 

where t' k := (yi{k), . . . , y L (k)). (2c) 
u> > Clog(N), for some absolute constant C. (2d) 

Theorem 1. Under assumptions (0, with probability 1 — o(l), a fraction ofl — o(l) users u is such that 
1 1 y/N z' u — t! k , u } || = o(l), where k(u) denotes the class ofu. That is to say, most users have their ( scaled) 
profile vector close to a fixed vector characteristic of their class. 

Each vector t' k corresponds to a class Ck- If two classes k, I were such that t k = t( = t, the theorem 
ensures that the profiles of users of both classes would be grouped around the constant vector t in the 
profile space. Hence, it would be impossible to distinguish between the users of the two classes based 
solely on their profile vectors. Condition ( f2cb ensures that 

H4-4ll = o(i), 

thus guaranteeing the ability to distinguish between distinct classes for a large enough number of users. 

We prove that eigenvectors corresponding to non-zero eigenvalues of matrix A can be used to recover 
the hidden classes. Condition ( f2ab ensures selection of a suitable dimension L. We impose the technical 
condition d2bl for presentation ease. 

Condition d2dl i gives a lower bound of log N on order of the required average user neighbourhood 
size uj. The theorem states that for cj growing at least as fast as logiV, the initial partition in profiles Ck 
can be recovered with high probability for almost all users. 

Remark 1. Concerning condition d2db , it is plausible that an even lower requirement for oj ( i.e. constant) 
suffices. Such a case has been explored in the context of bounding the second eigenvalue of the adjacency 
matrix of a sparse random graph to 0{^/lo), by removing high degree outlier nodes from the graph 
(see El). 

Remark 2. Other flavours of matrices than the adjacency matrix A could be considered for spectral anal- 
ysis (e.g. Laplacian matrix, normalised Laplacian matrix). We do not address either of these scenarios 
in the present work. 

We now give the main steps in the proof of Theorem Q] The auxiliary lemmas are proved in the 
appendix. Consider the matrix 

A := (pbk( u )k(v))u,v, 

which, according to our model, is equal to the expectation WiA of the partially observed similarity matrix 
A. We can write A — A + Q, where Q := A — A. 

The theorem relies on the fact that the block matrix A imposes the eigenvalues and eigenvectors of A, 
while the perturbation matrix Q has little influence therein, as follows from the lemmas below: 

Lemma 1. The top L largest magnitude eigenvalues of A have distinct absolute values and are order of 
Q(lu). The normalised eigenvectors (xi)^ =1 corresponding to these eigenvalues are constant on indices 
corresponding to each user class. Specifically, using the yg defined in ( f2cb . we can write 

() = W(k0) < < 
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Use the following ordering of the eigenvalues of A and A: 

\M\ > |A 2 | > ... > \X L \ (4) 
|Ai|> |A 2 | >... > |A L |. 

We also denote by Xk and Xk the corresponding normalised eigenvectors. 
To control the influence of the perturbation matrix Q we use the following 

Lemma 2. Consider a square N x N symmetric 0-diagonal random matrix A such that its elements 
Aij = Aji are independent Bernoulli random variables with parameters EAy = pijivN^ 1 , where the 
Pij are constant andcu = f2(log N). Then with high probability the spectral radius of the matrix A — E^4 
satisfies the upper bound p(A — EA) < 0{\fuj). 

Denote by D = <&&g(bk(u)k(u)Tf) 1 < u < N). By application of Lemma|2]to matrix A, we get 
that the spectral radius of Qq := A — MA is upper bounded by 0(y/uj) with high probability. Since 
p(D) < O(jf), we have that the spectral radius of Q = Qq — D is also upper bounded by 0(y/uF) with 
high probability. 

The previous two lemmas are instrumental in proving that A and A have the same spectral structure: 
Lemma 3. Using the ordering (@ for the eigenvalues of A and A, it holds that for all 1 < k < K 

||A*|-|Ajk|| <0(VS) whp, (5) 
sm(xi^x~k) < 0(w~ 1/4 ) whp. (6) 

We conclude by an application of Tchebitchev's inequality. Lemma[3]shows that with high probability 
we have \\x e - x e \\ 2 < 0(lo~ 1/2 ), for all 1 < I < L. Condition (|2cJ guarantees that - t[\\ = 0(1). 

By Tchebitchev's inequality, for any constant a > 0, and for a user U chosen uniformly at random, 
denoting by k(U) its class, it holds that: 

P(ll - > a) < 1 f ^<-/^ 2 §± = (a-^). 

u=l i=l 

Thus we will be able to conclude the result of the theorem if we can find an a such that a = o(l) and 
a _2 w -1 / 2 = o(l). It is easy to see that for instance a = a; -1 / 6 satisfies these conditions. □ 

3.2 Application: Extension to Content Recommendation 

Let us now consider a seemingly different scenario. Denote again by N = \U\ the number of users and 
by F = | J 7 ! the number of items. Assume without loss of generality that N > F. We can write F = jN, 
with < 7 < 1. We consider the rectangular observed rating matrix 

s e {o, i} NxF , 

where S U i = 1 if user u has rated and liked item i and otherwise. If an entry S u i is null, it is not 
necessarily true that user u dislikes item i (the item might have simply not been rated). 

We propose the following representation of users based on this collected information. For some 
dimension L, extract the L normalised left singular vectors x\, . . . , xl of S, corresponding to its L 
largest singular values. Like in the previous subsection, consider the L-dimensional profile space M, L , in 
which we associate a scaled row vector V^/V z' u to user u, where 

z' u = (xi(u),...,x L (u)). 
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The scaling by is again due to the fact that the singular vectors are normalised. 

Recall that the Singular Value Decomposition (SVD) of a rectangular real matrix S = XY.Y is always 
well defined, where matrices X E M, NxF and Y £ R FxF are unitary and matrix £ e R FxF is positive 
diagonal. 

Again, we claim that users with similar taste will be mapped to close-by locations in the profile 
space. More specifically, we show that local voting in the profile space provides users with accurate 
recommendations for most of the items. In what follows, we give a probabilistic model of user taste and 
scaling assumptions for which we prove these claims. 

3.2.1 Statistical Model 

We consider a probabilistic model of user taste similar to the one in Section 13.1.11 The N users are 
partitioned into K disjoint classes C\, . . . , Ck and the F items are partitioned into K' disjoint classes 
Di, . . . , Dk'- Users and items in the same class are statistically identical. The size of user class Ck is 
denoted by 

\C k \=a k N, l<k<K, 
while the size of item class D/.' is denoted by 

\Du\=faF, l<k'<K'. 

The (afe)fe and (j3k')k' are strictly positive and sum to 1: 

ak = Y P k ' =1 > ak> °> P k ' > °- 

k k' 

For any user u e U we denote by k(u) her unique corresponding user class, and for any item i e T we 
denote by k'(i) its unique corresponding item class. 

Pairs of user and item classes (Ck , Dk' ). 1 < k < K,l < k! < K', are characterised by probabilities 
Tkk 1 in the following way: Any user u G Ck likes any item i E Dk> with probability rkk 1 and dislikes 
it with probability 1 — rkk' ■ For all user-item pairs (u € U,i G J 7 ), u decides to rate i with probability 
p = ojN. Equivalently, any element S U i of the observed rating matrix S is obtained by drawing a 
Bernoulli random variable of parameter p ■ r fc ( u ) fc / ^ . 

Given the observed rating matrix S, without knowledge of the K x K' affinity matrix R := (rkk')k,k' , 
we wish to recover the partition of users in the K classes Ck by making use of their spectral representa- 
tion. We provide sufficient scaling conditions in what follows. 

We summarise the notations we use in Table [3] 

3.2.2 Scaling Assumptions 

We make again Assumption [2] and take the number of classes of users K and of items K' to be constant 
with N. We assume that the fraction of users in each user class ak, the fraction of items in each item class 
/3k>, the ratio between the number of items and the number of users 7 = ^, as well as the class affinity 
matrix R = (rkk')k,k' are constant with respect to N. Thus, the class sizes, as well as the total number 
of items grow linearly with N. 

We assume that the rating probability p vanishes as the number of users N grows to infinity. Specifi- 
cally, we again impose condition (Q~|i on parameter lu. Note that the expected number of items rated by a 
user is order of Q(lj), 

Let us now formulate and prove an extension of Theorem[]]that we can apply in this setting. 
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N 

F = jN 
it, v 

i,j 
k,£ 

k',£' 
k'ii) 

{Cfe}fc=i 

{Dk'}kLi 
a 

R 



Number of users 
Number of items 
Indices referring to users 
Indices referring to items 
User profile indices 
Item class indices 

Index of the class containing item i : i £ D k , 
Partition of users into K disjoint classes 
Partition of items into K' disjoint classes 
Vector grouping fractions of users per class: 
Vector grouping fractions of items per class: 
Constant unknown K x K' affinity matrix 



c k \ 

\D k , 



Table 3: Preferred indices 



3.2.3 Content Recommendation 

Consider vectors 



a = i a k)k=n P = ((3k')k'=i, 
and define the following constant square matrix 

G := i?diag(^)i?'diag(a) £ R KxK . 

We impose the following conditions, similar to 

The dimension of the profile space is upper bounded by L < rank(G). 
The L largest eigenvalues of matrix G are distinct. 

The corresponding eigenvectors {gi}f =1 normalised for the a-norm satisfy: 

Xk+ Xe, 1 < k < I < K, 

where Xk = (gi(k), . . . , g L (k)). 
u) > C log N for some constant C. 

We can now state the following 



(7a) 
(7b) 



(7c) 
(7d) 



Theorem 2. Under conditions (Q, with probability 1 — o(l), a fraction ofl — o(l) users u is such that 
||V^V z' u — Xfcll = That is to say, most users have their scaled profile vector close to a fixed vector 

corresponding to their class. 

Before giving the main points of the proof of Theorem|2l let us explain how this seemingly distinct 
setup can be mapped to the previous one. Define the following transformation which produces a square 
matrix: 



S^A = 





5" 



S 




R (iV+F)x(JVH 



where S' denotes the transposition of matrix S. 

The spectrum of such a matrix is symmetrical (i.e. if a is an eigenvalue of A, then so is —a). Further- 
more, the absolute values of the eigenvalues of A are the singular values of S and its singular components 
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are determined by the eigenvectors of A. To see this, consider an eigenvalue a of A and its corresponding 



eigenvector £ 



, with x e M, Nxl and y £ xi . Since AC, = a(, we can write: 



>Fxl 



.4 



x 



Sy 
S'x 



and A 



Sy 
S'x 



and thus x' Sy = y' S'x = cr\\x\\ 2 = &\\y\\ 2 . Then, we have that a; is a left singular vector, and that y is a 
right singular vector for matrix S. For more details see for instance [ 16 1. 
We can now prove Theorem|2] We make the following notation: 

S ■= (pr k ( u )k'(i))u,i = E5. 

The following Lemmas characterise the structure of the singular decomposition of S and S. They show 
that the two matrices have the same spectral structure. The proofs are found in the appendix. 

Lemma 4. For L < K', the top L largest singular values of S are distinct and of order Q{uj). The 
normalised left-singular vectors (xg)f =1 corresponding to these singular values are constant on indices 
corresponding to each user class. Specifically, using the gg defined in ( I7cb , we can write 



We denote the singular values of S and S by 

G\ > CT2 > •■• > &L > 
01 > "2 > ••• > Oh > 0. 

and the corresponding left and right normalised singular vectors by x k , yv , Xf~, and yw . 
Lemma 5. For alll < k < K 



(8) 



(9) 



sin(xI7^fc) < Olu)~ l/i ) 



whp, 
whp. 



(10) 
(11) 



To conclude, let us use a similar argument to the one in the proof of TheoremQ] Lemma[5] shows that 
with high probability we have \\x£ — xe\\ 2 < O^" 1 ^ 2 ), for all 1 < I < L, Condition ( TTcl i guarantees 
that Hx'fe - Xill = 

Again, by Tchebitchev's inequality, for any constant a > 0, and for a user U chosen uniformly at 
random, it holds that: 



(12) 

□ 



ViWVNz'v - x' HU) \\ >a)< 0(a- 2 ^ 2 ). 
We conclude the proof by choosing again a = a; -1 / 6 . 



3.2.4 Characterising Performance of a Simple Voting Algorithm 

Let us now analyse a simple recommendation algorithm that relies on local voting in the profile space. 
We have defined for each user u a scaled L-dimensional profile vector which we denoted by yf~Nz' u . In 
Section |4] we propose a method for computing such profile vectors in a distributed fashion based solely 
on local information exchange. 
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Say we wish to characterise the taste of user u for item i. Consider a fixed constant d > 0. We define 
the d-vicinity of u as the set of users that have profile vectors at a Euclidean distance of at most d from 
\/N z' u in the profile space. More formally, 

V d (u) := {v e U : \\z u - z v \\ 2 < -^=}. 

v N 



A vote is run among users in Vd(u) to collect their appreciation for item i. The estimated preference 
p Ut i of user u for item i is defined as the proportion of users in Vd(u) that have given a rating of 1 to i, 
that is 

^ l = Wu)\ ^ Sm ' (B) 

The following proposition guarantees that, for well chosen d and as the number of users grows to 
infinity, the voting algorithm above produces accurate recommendations for most users and most items. 

Proposition 1. Define S := min.i<fc<^<£ \\x k — X£ II 2> where Xk ore defined in ( TTcb . If d < 5/2, then with 
high probability, for a fraction of {1 — o(l)) of users u and a fraction (1 — o(l)) of items i, the estimator 
p Ut i defined in ( 1131 ) obtained via the voting procedure is of the form 



p u< i = e(^) (r kk ,+o(l)). 



Proof. For any user u denote the distance from her scaled profile vector to the vector Xk(u) corresponding 
to her class by do(u) :— \\^/~Nz' u — Xfc( u )l|2- For some positive D, we call D-good users those users u 
having do(u) < D. 

A simple application of <TT~2T > for constant D gives 

P(do(EO > D) < Oiuj- 1 / 2 ), 

where U is a user taken uniformly at random. Denote by Y(D) the number users that have their scaled 
profile vector at distance larger than D from the vector Xk corresponding to their class. Then it follows 
that, 

N 

EY(D) < 0(-=) = o(N). 



Thus there exists a fraction (1 — o(l)) of Z?-good users. 

Let e be a fixed constant such that < e < d. We consider (d — e)-good users u of class k, that is 
with their scaled profile vectors at distance at most d — e of Xk (i-e-, do(u) < d — e). 

User u G Ck wishes to obtain a recommendation for a certain item i and runs a vote among the 
users in Vd(u) (i.e., within the ball of radius d centred at her profile vector) as described above. This ball 
contains the set of e-good users of class k, which we denote by := {v £ Ck ■ do(v) < e}. Since e is 
a constant, the number of users outside of UeV* (i.e., that are not e-good) is at most o(N). We can write 

\V e k \ = \Ck\-\V e k \=a k N + o(N), 

where V k denotes the set of users of class k that are not e-good (i.e., V k = Ck \ V k ). 

For any (d — e)-good user u of class k, by construction we have that V* (~1 Vd(u) = 0, for all £ 7^ k. 
Thus, the set of users in Vd(u) which are not of class k is upper-bounded by 

\V d (u)\C k \ < I U e ^k V e l \ < o(N). 

Moreover, since V e fe C Vd(u), 

\V d (u) nC k \> \V e k \ = a k N + o(N). 
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Thus, most of the users of class Ck are present in Vd (u), and at most 0(N / y/U) outliers pollute the voting 
scheme. In what follows we show that the influence of the outliers vanishes asymptotically. 

Denote by W the set of items having received more than / votes from the 0(-^=) outliers in Vd(u). 
Then we have that with high probability 

m < ^ 

Take / = o(ui), but such that \W\ = o(N). For example, pick / = u 3 ^ 4 . 

For any item i W (there is a fraction (1 — o(l)) of such items), denote its class by k' — k'(i). Then 
we can write 

v ; \vev d (u)nc k j 

where the X v are independent Bernoulli random variables identically distributed with parameter prkk' = 
jjrkk'- They correspond to the votes of users in the same class Ck as u that belong to Vd(u). Since 
the number of such users is at least a k N — O(-^L) = OfakN), by application of a Chernoff bound we 
obtain that with high probability 

0(1) 

which concludes the proof. □ 



3.3 Spectral Profiling in Practice 

In this section we evaluate the benefits of spectral techniques on a real trace provided by Netflix. The 
Netflix data set contains about 10 8 user ratings for 17, 770 movies by 480, 000 users. The ratings are 
given in the form of an integer number of stars, ranging from 1 to 5. 

We evaluate taste similarity between users that are assigned close-by profiles in the spectral embed- 
ding. We do this as follows: We select a set of 2000 users and a set of 2000 movies from the Netflix 
data set, such that the selected users have given roughly the same number of ratings within the selected 
movie set. The presence of a rating is viewed in this setting as a sign that the user has viewed that partic- 
ular movie, and is therefore considered as a binary form of appreciation (the lack of a rating denoting a 
potential lack of interest for that content). Subsequently, for each user we hide the rating of one content 
at random. Using the remaining observed ratings, we build a sparse observed — 1 rating matrix S, 
and we compute the spectral profiles \/N z u of the users. For each user, an ordered list of neighbours is 
implicitly defined, from the closest to the farthest one in the profile space. We compute over the set of 
users the average frequency of the occurence of the following event: "a user at distance fc has rated the 
hidden content". The average is taken over the set of users. This "frequency of agreement" reflects the 
taste proximity of users. 

In Figure Q] we plot this "frequency of agreement" for different values of the dimensionality L of the 
embedding. Content popularity (i.e. the fraction of users having rated it) ranges from 0.45% to 5.3%, 
and the average popularity of content is 2.13%. We show this value on our plots for comparison. In 
Figure [T(a)1 we vary the dimension from 2 to 150 and plot the average frequency of agreement with the 
nearest neighbour. We notice that there is a peak in this average frequency around roughly 30 dimensions. 
Subsequently the plotted frequency decays slowly. In Figure |l(b)| we plot the average frequency of 
agreement for 2, 30 and 150 dimensions for the 100 closest neighbours in the profile space. We conclude 
that an embedding of rank 30 is appropriate to characterise user taste for the selected users in the trace. 
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(a) Average frequency of agreement with nearest neighbour 



(b) Average frequency of agreement with neighbours 



Figure 1 : Profiling Proximity 

It is important to observe that the frequency decays with distance in the profile space. This indicates 
that spectral profiling manages to capture user taste proximity. 

4 Oja's Algorithm and Beyond 

We now propose a method for extracting the eigenvectors of the adjacency matrix of a graph in a dis- 
tributed manner. Eigenvector extraction is the object of Oja's algorithm lfT4l . the basic version of which 
is refined by Borkar and Meyn in (2j. 

Consider a sequence of symmetric square random matrices (A k <= R NxN ) k of common finite mean 
A € M, NxN . Oja and Karhunen [14| proposed the following stochastic approximation algorithm for 
determining the s top eigenvectors of A: 



X k = AVi + A k X k ^T k , (14) 
r k \ (15) 



where X k is an N x L matrix, T k is a diagonal matrix of gains and R k 1 is a matrix achieving the or- 
thonormalisation of the columns of X k . They prove almost sure convergence under typical assumptions 
on the sequence of gains, assuming unit multiplicity of the top s eigenvalues, probability density uni- 
formly bounded away from for each of the A k , and almost sure boundedness and symmetry of the A k , 
as well as statistical independence. 

A simpler single-step form of the algorithm was also proposed by Oja and Karhunen. The conver- 
gence of a slightly modified version thereof was showed by Borkar and Meyn in [2], who introduced the 
additional factor - — r r 1 and added the additional i.i.d. A/Y0, /) noise sequence 

1+ LY(X k X k ) 

X k - A fc _i = -, k vr A (I - Xk-iX'k-i)Ak-iXk-i +6c]- 

1 + Lr{X k X k ) 

Here (a k ) is an almost typical gain sequence: 



2 



E 



a k = oo, 



E2 „ Sn>fc a n 
a k < oo, sup = < OO 

, k ®k 



None of these approaches are distributed however. Our contribution consists precisely in augmenting 
these methods to make them distributed. 
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4.1 A Method for Distributed Spectral Profiling 

In Section [3] we have demonstrated the benefits of spectral profiling. We have seen that given a set of 
N users, it is sufficient that each of them contact on average only £!(log(iV)) other users at random 
and determine similarity with them to essentially manage to characterise the profiles of everyone, by 
application of the spectral transformation. 

In this section we consider a (sparse) graph obtained via such a probing process. We develop message- 
passing algorithms that enable all users to individually compute their spectral profile, while allowing 
communication only between pairs of neighbours. The algorithms we propose in this section only require 
the connectivity property of the graph (and hence no specific bound on neighbourhood sizes). However, 
the scenario of a sparse graph is most appealing, as it fits very well with the model we introduced in the 
previous section. 

Let us thus consider a network described by an undirected graph Q = (U,£), where U is the set 
of N nodes (also designated interchangeably by "users"), and £ is the set of edges connecting these 
nodes. We denote by Af u the set of neighbours of u, and also write u ~ v to indicate that two nodes 
u, v are neighbours. Nodes that share a link will compute their "similarity" value. The user similarity 
values computed for peers connected by an edge define an adjacency matrix, which we call A. In this 
section we show that nodes are able to compute individually via message passing their coordinates in 
an i-dimensional profile space. These coordinates form a collection of L linearly independent vectors 
which span the vector space generated by the L eigenvectors corresponding to the top L largest magnitude 
eigenvalues of A. 

In the context of Section I3~T1 we can think of the graph Q as a random G(N,p) graph, where p is 
the probing probability. Since in the model we have introduced the eigenvectors are quasi-constant on 
each of the user profiles, a linear combination thereof will have the same property. Hence, the results 
of Theorem Q] hold for such a collection of vectors, for a modified version of condition d2cl . More 
specifically, consider a full rank L x L matrix W — {wki) of linear coefficients, and the matrix X — 
(x\, . . . x l ) of eigenvectors of the adjacency matrix A. Results in Section [3] hold for spectral profile 
vectors given by XW, under the following condition: 

The matrix of normalised eigenvectors Y = (j/i, . . . , y£) under the a-norm of matrix 
M = -E>diag(a) satisfy 

t' k ^t' e ,i<k^e<K, 

where f fc := (Y Wl (k), . . . ,Yw L (k)), (gcj). 

For presentation ease, in the following we assume that A is positive semidefinite. 
We now describe our proposed method. Given some fixed number L of target eigenvectors to be 
extracted, each user u maintains at all time t three sets of variables: 

1 . An i-dimensional row vector X u (t), the sought-for eigenvector coordinates; 

2. An L x L matrix $ u (t) and a scalar ^ u (t), both playing an auxiliary role in the calculation. 

For ease of presentation, we assume slotted time t = 0,1,2,..., and synchronous updates at all peers. 
Asynchronous versions will be described and tested in the next section. 
Our algorithm then takes the following form: 



x u (t + i)-x u (t)- " (/) 



£ A vu X v (t) - NX u (t)$ u (t) + £ u (t + 1) 



(16) 



Y u {t) 

In the above, a(t) is a gain parameter to be specified, £ u (t + 1) is a noise term deliberately introduced by 
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user u, and the denominator Y u (t) is taken equal to 

F u (0 = max ( U^W.ttto > . ('/))/., | I . (17) 



k,Z=l J 



It is readily seen that the update ( [ToT l can be computed locally at u, solely relying on variables local to 
node u and inputs X v (t) from it's neighbours v E J\f u - The same is also true for the updates of variables 
<1> U and which take the forms: 

$„(< + 1) = + b(t) (*«(*) - *«(*)) + Ut + 1) - /„(<), (18) 
where b(t) is a gain parameter, f u (t) is a L x L matrix, specified by 

/ tt (t) =^(i)^A„^)i (19) 

and 

V u (t+1) = + b{t) (*«(*) - *«(*)) + 1) - &.(*), (20) 

where ^(t) is a scalar, specified by 

g u {t) = X u {t)X' u {t). (21) 

Before stating the main result of this section, we introduce the technical conditions that will be required 
from the gain sequences a(t), b(t): 

a(t), b(t) e [0,1], t > 0, (22a) 

^ a(t) = ^ b(t) = +oo, (22b) 
t>o t>o 

a{tf < +oo, lim6(t) = 0, (22c) 

t>o 

li m ^-e K ^ a ^ =0, K>0 (22d) 

Note that these conditions are satisfied for instance upon taking a(t) = l/(tlog(i)), and b(t) = t~ 2 / 3 . 
Indeed, with this choice for a(t), it is readily seen that 

t 

£a(s)~log(log(i)), 

8=1 

and ( I22bl i follows. In addition, the quantity in ( 122dl i then reads 

1 p if(l+ (l))log(log(t)) < (togft)) 2 ^" 1 



tV3 1 g(t) <V3 

where we have used the upper bound of 1 on the term o(l), and property ( 122dl i readily follows. 
We are now in a position to state this section's main result: 

Theorem 3. Assume that the gains a(t), b(t) verify the conditions (I22l >. Assume further that the noise 
terms are i.i.d, white Gaussian noise. Assume finally that the overlay graph over which peers com- 
municate is connected. Then the distributed updating algorithm ( 17614271 ) verifies the following property. 
With probability 1, the columns of X(t) := (X u (t)) ue u converge to a collection of L orthonormal vec- 
tors spanning the vector space associated with the L largest eigenvalues of the weighted adjacency matrix 
A. ' 
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The proof of the theorem is given in Section |F) In what follows, we provide some background and 
intuition for it. 

Consider first the main equation, ( TTol l. If we ignore the denominator Y u (t), the noise term £ u (t + 1), 
and replace the term NX u (t)§ u (t) by ^ X u (t)f v (t), where f(t) is as given in (fT9l l. this equation reads, 
written in matrix form: 

X(t + 1) - X{t) = a(t) [AX(t] - X(t)X'(t)AX(t)} . 

This is in fact the celebrated Oja algorithm, proposed by Oja and Karhunen lfT4l to extract precisely 
the eigenvectors of the largest eigenvalues of A. Oja's algorithm is subject to some stability issues, that 
Borkar and Meyn [2| proposed to fix by scaling down the right-hand side of the previous equation by 
some factor Z(t) = 1 + J2 U k -^u fc(*)> an( ^ by adding an extra noise term £(t + 1) in the bracket in the 
right-hand side. Thus, the update rule they considered reads: 

X(t + 1) - X(t) = |j| [AX(t) - X(t)X'(t)AX(t) + ^ + 1)] , (23) 

and is proved in (|2|) to converge with probability 1 to the desired eigenvectors, under assumptions ( I22bb . (l22cb 
on the gains a(t), and similar conditions on the noise £(t) as in our theorem. 

However, algorithm d23l does not lend itself to a distributed implementation, since neither of the two 
terms X'(t)AX(t) or Z(t) can be computed locally by the users. 

To solve this issue, we introduce the auxiliary local variables $„, \? u . The dynamics (TT~8T42Qb ac- 
cording to which they evolve is best understood by setting to zero the input terms f u (t + 1) — f u (t) and 
9u(t + 1) — g u (t) in the right-hand side. It then becomes apparent that these dynamics perform local 
averaging (also known as gossiping in Q). Thus these eventually converge to a state where all variables 
$ u (t) coincide with the average (l/N) £V $«(0) of the original entries. 

We can now provide a heuristic argument for the theorem. On a fast time scale, characterised by the 
gain parameters b(t), the gossiping dynamics converge to almost constant vectors, with 

*«(i) = iE„s«W uew. 

Then on a slower time scale dictated by the gain parameters a(t), the variables of interest X u (t) follow 
dynamics very close to d23l . Indeed, the auxiliary parameters <!>„, \I>„ track accurately the desired terms 
X'(t)AX(t) and Z(t) respectively. 

A couple of remarks are in order. The stabilisation by the scaling factor Z(t) in (l23l seems insufficient 
in the presence of the additional dynamics ( 118120b . This leads us to introduce our alternative stabilisation 
via Y u (t) in ( fT7b - Also, in problems with dynamics at two time scales a common assumption on the gain 
parameters is that a(t)/b(t) —> 0. In the present case, a stronger form of time scale separation (namely, 
condition ( 122dl >) is needed, to prevent reinforcing instabilities between the two dynamics. 

4.2 Evaluation of an Asynchronous Version 

In this section we present numerical evaluations on synthetic data. We exhibit convergence of an asyn- 
chronous version of the distributed coordinate assignment scheme presented in the previous section on 
synthetic data generated according to the model presented in Section [3~Tl 

In Section HTTI we showed that the distributed algorithm ( fT6HT8l converges almost surely towards L 
linearly independent vectors spanning the vector space generated by the eigenvectors corresponding to 
the top L magnitude eigenvalues of the adjacency matrix A. 

In the following, we evaluate the asynchronous version of the algorithm. In this setting, each node 
u keeps track of its own coordinates X u as well as the gossiped variables <1> U and ^> u . However, instead 
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Algorithm 1 Distributed profiling algorithm 



NoDE(u)::UPDATE-LoCAL()atrate A 
Local Variables: X u , U u , w u , X°, 11° , $ u 
i: if w u = 1 then 

2: Retrieve partial product vectors II„ from v E TV, 




u 



3: X u := X u + ^ 



4: else 

5: Retrieve vectors X v from u € A/", 



& : n. lt : — A UV X V 
7: end if 




Link(u,u)::Gossip() at rate [i 
1: Retrieve local variables at u and t> 



2: for ft) in {($,/< 2 ) ),(*,«?)} do 




4: for i in {u, v} do 

5: 8, ■.= h^x^n,) ~ h^x^n^) 

6: i?i := a + Si 
7: end for 
8: end for 



9: X® :— X u , n° :— X® :— X v , Tl u :— II 



of explicitly imposing a timescale separation via gains a(t) and b(t) while enforcing a synchronised 
evolution of all the quantities, we impose distinct rates at which the updates are performed. Namely, 
the coordinate updates ( TTol l are performed independently according to Poisson processes of rate A, while 
gossiping (1 1 7H 1 8b is performed independently "pairwise" according to Poisson processes of rate /i ^> A. 
By pairwise we mean that a pair of nodes (u, v) 6 £ will exchange and update their values for $ and ^ 
at rate fi similarly to the randomised gossiping technique from Q. 

Furthermore, we replace the adjacency matrix by its square A 2 . We choose to do so, since the latter 
is positive semidefinite, has the same eigenvectors as A, and a spectrum composed of the squared eigen- 
values of A. Implicitly, the eigenvectors are ordered according to the magnitude of the eigenvalues of A, 
instead of their actual values. In turn, this modification alters the function /„ from ([T9T l, which becomes: 



The algorithm executed at each node is summarised in Algorithm[T] 

Since sparsity is not preserved by taking the square of A, we cannot simply use A 2 as a new ad- 
jacency matrix. Thus, we need to compute products A 2 X specifically: For some node u, every other 
call to the Update-Local() procedure computes the partial products II U = Ylv~u A U vX v (State [6] in 
the Algorithm). Subsequently, the neighbours' partial product vectors II„ are used in the coordinate up- 
date procedure ( TT6I ) at State [3] and for the gossiping of fu — n4n u . Each node additionally stores its 
previous vectors X® and IT° for use in the Gossip() procedure. 

The only piece of global information required is the number of users in the system (or an approxi- 
mation thereof). In the Update-Local() procedure, we used a fixed gain 7. The noise component £ is 
omitted in the algorithm. Even so, noise is intrinsic to the algorithm as it is introduced by both the gossip 
averaging and by the fact that exchanges are asynchronous. 




(24) 
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Figure 2: Pure and noisy user profiles 



For evaluation of convergence, we used a synthetic data trace generated according to our model from 
Section [37X1 This trace considers 2200 users clustered in 4 classes of sizes 200, 500, 600 and 900. The 
probability matrix pB characterising the classes is 



We consider an adjacency matrix A generated using the aforementioned parameters. This matrix is sparse 
(the average node degree is 40). For visualisation ease, we consider the case L — 2 (i.e. the profile space 
is a 2-dimensional plane). In Figure|2]we plot the first two eigenvectors of the expected adjacency matrix 
A. They are constant on each of the 4 classes, hence the plot is constituted of 4 points, which are depicted 
as the four black squares. They represent the pure user profiles which characterise each of the four 
classes. Additionally, we plot the first two eigenvectors of the "noisy" adjacency matrix A, with elements 
belonging to a specific class marked with the same symbol. We have the visual confirmation that despite 
the sparsity of matrix A, the noisy profiles are grouped around the pure profiles. 

We choose a gain 7 = 0.001, an update rate A = 0.2 and a gossip rate [i = 10. We initialise the 
algorithm at a random state. In Figure 14.21 we plot the time evolution of the proportion of the mass of 
the two coordinate vectors X.\ and X.2 (aggregated across users) that falls on the space orthogonal to 
the 2-dimensional eigenspace generated by the first two eigenvectors of matrix A 2 . Additionally, we plot 
the scalar product of the two coordinate vectors. After roughly 400 time units, we observe convergence 
towards orthogonal vectors spanning the desired eigenspace. 

5 Conclusions 

In this paper we addressed the problem of distributed user profiling and recommendation. 

We first showed that spectral techniques constitute an appealing approach, and obtained novel results 
on their efficiency, thereby improving upon previous literature on the subject of spectral clustering. We 
showed that a for a low-rank probabilistic model of user taste, a simple distributed algorithm based on 
local votes in the profile space asymptotically achieves accurate prediction of user preference. 



pB = 
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Figure 3: Convergence of the asynchronous distributed algorithm 



We developed techniques for computing eigenvectors in a distributed manner. Our solution combines 
ideas from Oja's algorithm with gossiping algorithms. From a theoretical standpoint, it essentially relies 
on a special form of multiple time scale stochastic approximation. The resulting technique may have 
other applications besides user profiling. 

Finally we evaluated our proposed methods on synthetic and actual data traces. We thereby validated 
our analysis in observing convergence to the desired eigenvectors. We could further show, based on the 
Netfiix prize dataset, that accurate recommendations could be made at limited communication cost based 
on our spectral embedding. 

Several research directions can be envisioned to take this work further. One intriguing problem con- 
cerns privacy. While our methods do not rely on direct exchange of sensitive private information, they 
may nevertheless lead to private information leakage. A distributed solution avoiding the issue is yet 
unavailable. 

Other directions concern the fine tuning of the methods. The issue of analytical selection of the 
number of eigenvectors has not been addressed here. The recent work of Shi et al. lfT31 could be an 
appealing solution. 
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A Proof of Lemma ED 



Consider an eigenvalue A, |A| > of A and a corresponding eigenvector x. For 1 < u < N, 

N K 

(Ax)(u) = j^Yl h Hu)k(v)x{v) = jj^2 bk ( u ) e *{v) = \x(u). 

v=l 1=1 v:k(v)=t 

For a large enough N, we have that \Ck\ > 1, for all k. This is true, since the size of each class grows 
linearly with N. Then for all k and for all u, u' £ Ct, with u' ^ u it follows that x{u') = x(u). 
Denote the value of x(u) by y(k(u)). Then, for all 1 < £ < K , 

K A 

l' = l 

Thus y is an eigenvector of the K xK matrix M = £?diag(a) corresponding to its eigenvalue fj. Since 
M is a constant matrix, its eigenvalues are also constants. Hence it must be that there exists a constant c 
such that X = coj = 9(w). By Condition ( l2bl we conclude that the top L magnitude eigenvalues of A 
have distinct absolute values. 

Finally, we have that 

\\x\U = yftf\\v\\«- 

Thus, if we consider a vector y normalised under the a-norm, the components of the normalised eigen- 
vector pjjj- are given by (|3J. □ 
Note that a consequence of this result is that A can have a number of non-zero eigenvalues that is at 
least L (by Condition ( l2ai i) and at most K (i.e. the maximum number of eigenvalues of matrix M). 



B Proof of Lemma |2] 

We start by establishing a simple result of stochastic dominance. 

By definition a random variable X is dominated for the convex ordering by another random variable 

Y (written as X < cx Y) if for any convex function / : [c, d] — > R such that F,f(X) and E/(F) exist, 
wehaveE/(X) < E/(F). 

It is known that (see for instance [ 1 1 1, Theorem 1.5.20): If X < cx Y, then there exist X and Y with 
the same distributions as the original variables, but which are such that E(Y \X) = X. 

Another result found in [11] states that for some closed interval [c, d], if X : Q, —> [c, d] and Y : il — >• 
{c, d} are two random variables such that EX = EY, then X < cx Y. 

In this latter setting, we wish to establish a variant of the former result. Namely, for some closed 
interval [c, d], if X : Cl — > [c, d], we wish to construct a random variable Y : fl — > {c, d} supported on 
the extremities {c,d} of the interval, such that EX = WY and E(Y|X) = X. To achieve this, pick a 
uniformly distributed random variable U ~ £/[0, 1] independent of X. We define the random variable 

Y : n -> {c, d} as f = F(Jf, {/), where 

u) = d- t {x < u{d _ c)+c }{d - c). 

Then 

E(F|X) = E(F(X, = d - - c) = X. 
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Since Y can possibly take only two values, we can compute the corresponding probabilities: 



d 



EX-c . ~, . d-EX 



P(Y = d)= I P x {dx) I du = — — — and P(Y = c) = — — — , (25) 

and hence EX = EY. 

Let us now proceed with the proof of the Lemma. 

Denote by Q := A — EA Denote further by p max — maxy p^ and by p m i n — rnhijj pij. Then the 
elements of Q all belong to the interval [— p ma x, 1 — Pmin]- For a large enough N such that p mox < 1, 
there exist a = {Z^Z > 1 and p = Pmax > such ^ at l~Pmax, 1 - Pmin} C a[-p, 1 - p]. 

Consider a symmetric matrix [/ of independent uniformly distributed random variables {[/y = L/jj ~ 
W[0, !]},<;• Then for each entry of Q, such that /' < j, which we regard as a random variable with 
values in the interval a[— p, 1 — p], we construct the random variable Zij = F(Qij, Uij) which has the 
desired property written in matrix form 

E(Z\Q) = Q. (26) 

Note that the entries of Z are mutually independent by construction. Furthermore, the random variables 
defined as {Yj = ct~ l Z; L j + p. Yji — Yij}i K j are mutually independent Bernoulli random variables 
of parameter p and hence form the adjacency matrix Y of an Erdos-Renyi graph of parameters (N,p). 
Denote Y := EY = p(ee' — I), where e is the all-ones column vector and e' denotes transposition. 

Let us now prove that the spectral radius of Z = a(Y — Y ) is upper bounded by 0(y/uj) with high 
probability. 

Since u = fi(log N), and the {Yj}i<j arc mutually independent, we can apply the results from [5]. 
Let y be any vector of norm 1 and denote u := -jj^e. We can decompose y as follows: y = ax + bu, 

where a 2 + b 2 = 1, and x is a vector of norm 1 orthogonal to u, x _L u. 

\y'(Y - Y)y\ < 2 \abx'(Y - Y)u\ +a 2 \x'(Y - Y)x\ +b 2 \u'(Y - Y)u\ . 

Ti T 2 T 3 

Denote by Si the degree of node i and by 6 :— the average degree. According to Lemma 2.2 

from [5 1 and taking into account the fact that e is an eigenvector of Y we have 

Ti = \x'Yu\ < 2V^, with probability 1 - e - fi (( Ar ") 1/3 ). 

By Theorem 2.5 and Claim 2.4 from 0, we have that for every constant c\ > 0, there exists another 
constant c-2 > such that: 

\x'Yx\ < c 2 y/u, with probability 1 - iV~ Cl . (27) 

We will restrict ourselves to constants c\ > 1 for reasons that will become apparent later in the proof. 
Thus, we can bound the second term with probability 1 — N~ Cl : 

T 2 < \x'Yx\ + \x'Yx\^ 0{^) + \^ Xl Y^PXj\ 

= o(v^) + |-p^ s ?ho(^) + e(^). 

Finally, using a Chernoff bound we find 



T 3 = \5- lj\ = 0{y/uj), withprob. 1 



e 



-O(JV) 
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Thus, p(Z) = 0(y/uj), with probability 1 — N 01 . Furthermore, there exists a constant a > such 
that p(Z) < aN. 

Let us now finally characterise the spectral radius of Q. 

Using the fact that the spectral radius is a convex function and by Jensen's inequality, we get 

p(Q) = p(W,[Z\Q]) <B[p(Z)\Q] 

We have a random variable R := p(Z) supported on [0,aiV], such that P(i? > t) < 0(N~ k ) for 
t = O(^uj) and we wish to deduce that the conditional expectation S := W*(R\Q) is also upper bounded 
by 0(y/uj) with high probability. 

Since R and S have countable state spaces, it makes sense to consider 

/3(s) := P(i? > t\S = s). 

Since R : O ->■ [0, aN], and since E(R\S) = S, we have that 0(s)(aN) + (1 - P(s))t > s, and thus, 
P(s) >(s- t)/(aN-t). Denoting 7 := F(S >t+l), we have 

F(R >t)= E(J3(S)) > E((3(S)l {s>t+1} ) > t ~^—^l = 

Hence 

7 = P(5 > t + 1) < (aN - t)V(R > t) = (aN - i)0(N~ C1 ) = o(l), 
since we considered ci > 1. □ 



C Proof of Lemma |3] 

We will show the two claims (f5]l and (O by induction. Denote (f5]l by 7\. and ©by Q^. 

We will begin by proving V\ and Qy. Since we make extensive use of Lemma [2] it is implied that 
all inequalities in this proof hold with high probability in the sense of Lemma [2] (that is with probability 
1 — e^^ 1 ^). All vectors are column vectors and we denote by x' the transpose of vector x. Furthermore, 
for two vectors x and y by x _L y we mean that their scalar product x'y equals zero. 

Using the variational characterisation of eigenvalues and Lemma|2] we get: 

||Ai| - | A x 1 1 w - s |Ai| - |Ai| < | A x | - \x\Axy\ < \x' 1 (A-A)x 1 \ < O(V^), 
which proves V\. 

We denote the first eigenvector of A by ii = a\X\ + biy%, where a\ + b\ = 1, x\ is the first 
eigenvector of A and x\ _L y\. Then, by making use of Vy, there exist positive constants 9% and 62 such 
that 

|Ai|-0iVw< I Ai I <o?|xiM|+6?|^i4yi| + daVw. 

We took into account the symmetry of A and the fact that Ax\ = XiXy. By the Courant-Fischer theorem, 
we get the following inequality: 

I Aij - 0Vw < |Ai| - 6?(|Ai| - |A 2 |), 9 > (28) 

and since the top L eigenvalues of A are distinct (by Condition (l2bb), it holds that < | Ax | — | A2 1 = 0(w). 
We get that b\<0(A=), thus proving Qi. 

In order to generalise this result, we make use of the following simple lemma: 
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Lemma 6. Let x\ and x\ be two non-orthogonal vectors of norm 1 such that 

1 



1 - {x' x x x ) A < 0(-=). (29) 



a; 



Then for all vectors Xi of norm 1 smc/j f/jaf Xi _L xi, (x 2 Xi) 2 < 0{-^=). 

Proof. We have xi = ax\ + by~\, where x\ _L yi, \\y~i\\ = 1 and a 2 + b 2 = 1. By hypothesis j29t . 
& 2 < O(^). Thus, 

x 2 xi = ax^xi + 6x 2 yi = 0, 



and thus, since a^O, 



where 6* > is a constant. □ 

We proceed by complete induction. We showed V\ and Q\. Now assume Vi and Qi are true for all 
1 < £ < k. We wish to show V k and Q^. 

Let us write x k = ait + (3y + jz, where y e Span{xfc + i, Xfc+2, . . .} and z £ Span{xi, . . . , Xk~i} 
and a 2 + (3 2 + j 2 = 1. Lemma|6]and the induction hypothesis show that -f 2 < (3(-i=). Since 

\x' k Ax k \ < (l- 7 2 )|A fc |+7 2 |Ai| < \X k \ + 0(V^), 

we can conclude that 

||A fc | - |A fc || = 8 |A fe | - |A fc | < |A fe | - \x' k Ax k \ + 0(V^) < 0{y/u), 

thus proving V k . 
We have that, 

|A fc | < \x k Ax k \ + \x' k (A-A)x k \ <a 2 |A fc |+/3 2 |A fc+1 |+ 7 2 |A 1 |+0(V^) 
< |A fc | + 4=(N - |Afc|) - f3 2 (\\ k \ - |A fc+ i|) + Byfc, 

where a and 9 are positive constants. Without loss of generality we now assume that ||Afe| — |Afe| = 
|A fe | - |Afe| and using V k we get 

|Afe|-|Afc+i| v w 
thus proving Q k . □ 

D Proof of Lemma |4] 

Consider a non-zero singular value of S, a > and corresponding left and right singular vectors x and 
y. Then we can write: 

JV F 

(S'x)i = ^ Sui%u = vy~i = cr/ife'(i), (^y)i = = = o-.g fe ( u ), 

1i — 1 Z— 1 
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since S U i = V~k{u)k' U) depends only on the class of u, k(u), and the class of i, k'(i). Since y is a right- 
singular vector of S, it is also an eigenvector of S'S corresponding to eigenvalue a 2 , and thus after some 
simplification we have that 

— -g = i?diag(/3)i?'diag(a)g, 
that is the if -dimensional vector g is an eigenvector of the constant matrix i?diag(/3)i?'diag(a) corre- 

2 

sponding to the eigenvalue -2^, Since the eigenvalues of a constant matrix are constant, it follows that 
there exists a constant A such that a = Xui = Q(uj). By Condition (f7bl > it follows that the largest L singu- 
lar values of S are distinct. Moreover, matrix i?diag(/3)i?'diag(a) can have at most K distinct non-zero 
eigenvalues and hence the same holds for the singular values of S. 
Finally, we have that 

\\xh = VN\\g\\ a . 

Thus, if we consider a vector g normalised under the a-norm, the components of the normalised eigen- 
vector |p|| are given by (0. □ 



E Proof of Lemma H] 

For A — tS we denote by £7" the normalised eigenvector corresponding to the eigenvalue <fk and by 
the normalised eigenvector corresponding to the eigenvalue — u%. We introduce similar notation for the 
eigenvectors of A = tS, namely Q~ and ^ . Then it holds that 



1 

71 



i 

71 



-Vk 



1 

71 



Vk 



1 

71 



Xk 

-Vk 



By Condition d7bb , we can apply a slightly modified Lemma [3] to matrix A. The only modification 
we need to make is to change the considered ordering of the eigenvalues - instead ordering them by 
largest magnitude, we order them decreasingly by value. Since we can apply Lemma[2]to A — A, it is 
straightforward to see that the proof also holds in this setting. 

We have thus that ( TTOb holds and that with high probability 



sin(C fc + ,Cl)<0(^ 1/4 )- 



To see that ( fTTT i holds as well, we write 

i-(C,c, + ) 2 = i 



-\ 2 



{(x k ,x k ) 
((x k ,x k ) 



(yk,yk)) 2 
(yk,Vk}) 2 



< 



< 



0. 



By summing the two expressions we get that: 



1 



(1 - (x k ,x k ) 2 ) + i(l - {y k ,y k ) 2 ) < Oiuj" 1 ' 2 ), 



hence the conclusion. 



□ 



F Proof of Theorem|3] 

The main ingredient in the proof is the following 
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Lemma 7. The update equation ( 1761 ) is such that for all t > 0, 

||X(t + l)|| <e K ^ a ^{\\X(l)\\+M) (30) 
for some positive constant K and some almost surely finite random variable M. 
Proof. Rewrite Equation ( fTo*b in matrix form as 

X(t + 1) - X(t) = a(t) [F((X, $, *)(*)) + ND-^tW + 1)] 

for some suitable function F, and where D{t) denotes the N x N diagonal matrix with diagonal entries 
Y u (t). By the specific form of the terms Y u (t), and their role in the function F, it is readily seen that the 
latter verifies 

< K X \\X\\ 

for some suitable constant K%. This readily implies that 

\W + 1)11 < (1 + K ia (t))\\X(t)\\ + a(t)r,(t), 
where the rj(t) = iV||£(t + 1)|| are iid. By induction, one then establishes that 



|*(*+1)|| <X\{l + K ia (s)) 



\X(l)\\+J2"(sMs) 



Denote now rj the expectation of rj(s), and let M(t) := ai^ivi 8 ) ~ v]- K is readily seen that 

M(t) is a uniformly integrable martingale, and hence the supremum sup s>0 M{s) is almost surely finite; 
denote it by M. It then follows from the above equation that 



i*(*+i)n <n( i+ ^ a ( s )) 



|X(1)|| + M 



Using the elementary inequality 1 + x < e x , one deduces: 

||X(t+l)|| < e EU^i«W [||X(1)|| + 



M 



The result (O then follows by setting K = K x + i], and M = 1 + max(0, M). 

This Lemma is now used to establish the following result: 

Lemma 8. The auxiliary variables *£ u {t) verify 

lim t ^ oo \N$ u (t)-J2 v f v (t)\ = 0, 
limt^oo |JVtf u (t) - E v 9v{t)\ = 0, 

i.e. asymptotically, these quantities do track accurately their intended targets. 



□ 



(31) 



Proof. We shall only consider the case of \t u (t), the other one being entirely similar. Rewrite the update 
rule (Tl~8T > in matrix form as 



(t + !) = (/- 6(<)A)*(t) + + 1) - S (t), 



(32) 



where A is the so-called Laplacian matrix of the overlay graph: A uu — \J\f u \, A uv — -l n ^„ for u ^ v. 
Recall that the Laplacian A is positive semi-definite, with eigenvector e = (1. . . . , 1)' associated to the 
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eigenvalue 0. Also, when the overlay graph is connected, all other eigenvectors are associated with strictly 
positive eigenvalues A > 0. 

Note then that, because g(t + 1) and g(t) have the same quadratic dependence on X(t + 1) and X(t) 
respectively, it follows from Lemma|7]and the bound on function F therein that 

\\g(t + 1) - g(t)\\ < a{t)K'e 2K ^=i a(s) 

for some suitable fixed finite term K' . By our Condition ( 122dl i this is asymptotically negligible compared 
to b(t). Let z be an eigenvector of A associated with a positive eigenvalue A. Denote z(t) the scalar 
product z'^(t). One then deduces from ( T32] | the equation 

z(t + 1) = (1 - Xb(t))z(t) + 6(t)e(t), 

where limt->oo = 0. By induction, one can deduce from the previous expression the identity 

t t t 

z(t + l) = z(l)l[(l~Xb( S ))+Y / e(s)b( S ) JJ (1-A6(<r)). 



Elementary analysis can then be used to deduce from this last display, assumptions (122bl i.( l22cl i on gains 
b(t), and convergence of e(t) to that z(t) also converges to 0. 

Indeed, take any fixed e > 0. Since e(t) — > 0, there exists some £q suc h that e(t) < for alii > to. 
Then, we can write: 



z(t + l) < «(1)JJ(1-A6(»)) (33) 

s=l 

to t 

+E e w 6 ( s ) n (i-w) (34) 



8=1 a— s+1 



A E 6 ( s ) n a-w)- 05) 



3 

s— 1 cr— s+1 



Term (f33T> develops as 

z(l) f[(l- Xb(s)) - 3(l) e £Ui 1 °s( 1 -^(^)) < «(l) e - A Ei=i b ^ ^ 0, 



by assumption d22cl >. Term d34l > consists of to terms (a finite number), each of which converges to by 
the same argument as above. Hence, there exists t\ such that for all t > t\, we have 

z(l)f[(l-Xb(s)) + f2<s)b(s) n (1-A6(a))<y. 

S— 1 S—l a — S + 1 

It can be shown that term d35t can be written as: 

f fl (l-A^)) = fi(l-n(l-A6( S )) > )<|. 

S =l CT=S + 1 \ S = l / 

We have shown that for all e > 0, there exists some t2 = to V ti such that for all £ > £2, we have 

z(t+l) < £. 

We thus obtain that when decomposing vector ^(t) according to the eigenbasis of matrix A, one 
finds vanishing coordinates except along eigenvector e. Since the scalar product e'^(t) is always equal 
to J2 U 9u(t), as follows from d32l l, the announced result follows. □ 
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To conclude the proof of the theorem, note now that, by the previous lemma, and our specific choice 
of gain parameters Y u (t) in ((TTJ, for large enough t, Equation ( TTol l reads in vector form 

X(t + 1) - X(t) = - a( l\ m A AX(t) - X{t)X'{t)AX(t) 

+ o(\\X(t)\\)+a(t + l)}. (36) 

As is readily seen, this coincides with the update rule d23l . except for the o(-) terms. The analysis of 
|2| for establishing convergence of d23l also applies in fact to its perturbed version d36b . and Theorem[3] 
follows. □ 
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